In-class_Ex5

Author

Shubhanshi

Setting the scene

To build an exploratory model to discover factor affecting water point status in Osun State, Nigeria.

Study are: Osun State, Nigeria

Data Sets:

Osun.rds, contains LGAs boundaries of Osun State. It is in sf polygon data frame, and

Osun_wp_sf.rds, contains water points within Osun State. It is in sf point data frame.

Model Variables

  • Dependent Variables: Water point status(ie. functional/non-functional)

  • Independent Variables:

    • distance_to_primary_road

    • distance_to_secondary_road

    • distance_to_tertiary_road

    • distance_to_city

    • distance_to_town

    • water_point_population

    • local_population_1Km

    • usage_capacity

    • is_urban

    • water_source_clean

Installing R Packages

Using the code chunk, following packages will be installed into R environment

pacman::p_load(sf, spdep, tmap, tidyverse,funModeling,blorr,corrplot,ggpubr,GWmodel, skimr, caret)

Data Import

In this class exercise, two data sets will be used.They are:

Importing analytical data

First, we are going to import the analytical data into R environment.

Osun <- read_rds("rds/Osun.rds")
Osun_wp_sf <- read_rds("rds/Osun_wp_sf.rds")
Osun_wp_sf %>%
  freq(input="status")
Warning: The `<scale>` argument of `guides()` cannot be `FALSE`. Use "none" instead as
of ggplot2 3.3.4.
ℹ The deprecated feature was likely used in the funModeling package.
  Please report the issue at <https://github.com/pablo14/funModeling/issues>.

  status frequency percentage cumulative_perc
1   TRUE      2642       55.5            55.5
2  FALSE      2118       44.5           100.0

From the above chart, it can interpreted that there are 2642 observation of “Functional water points” and 2118 observations of “Non-Functional Water points”.

tmap_mode("view")
tmap mode set to interactive viewing
tm_shape(Osun)+
  tm_polygons(alpha=0.4) + 
tm_shape(Osun_wp_sf) +
  tm_dots(col="status",
          alpha = 0.6) +
  tm_view (set.zoom.limits = c(9,12))

Exploratory Data Analysis

Summary Statistics with Skimr

Osun_wp_sf%>%
  skim()
Warning: Couldn't find skimmers for class: sfc_POINT, sfc; No user-defined `sfl`
provided. Falling back to `character`.
Data summary
Name Piped data
Number of rows 4760
Number of columns 75
_______________________
Column type frequency:
character 47
logical 5
numeric 23
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
source 0 1.00 5 44 0 2 0
report_date 0 1.00 22 22 0 42 0
status_id 0 1.00 2 7 0 3 0
water_source_clean 0 1.00 8 22 0 3 0
water_source_category 0 1.00 4 6 0 2 0
water_tech_clean 24 0.99 9 23 0 3 0
water_tech_category 24 0.99 9 15 0 2 0
facility_type 0 1.00 8 8 0 1 0
clean_country_name 0 1.00 7 7 0 1 0
clean_adm1 0 1.00 3 5 0 5 0
clean_adm2 0 1.00 3 14 0 35 0
clean_adm3 4760 0.00 NA NA 0 0 0
clean_adm4 4760 0.00 NA NA 0 0 0
installer 4760 0.00 NA NA 0 0 0
management_clean 1573 0.67 5 37 0 7 0
status_clean 0 1.00 9 32 0 7 0
pay 0 1.00 2 39 0 7 0
fecal_coliform_presence 4760 0.00 NA NA 0 0 0
subjective_quality 0 1.00 18 20 0 4 0
activity_id 4757 0.00 36 36 0 3 0
scheme_id 4760 0.00 NA NA 0 0 0
wpdx_id 0 1.00 12 12 0 4760 0
notes 0 1.00 2 96 0 3502 0
orig_lnk 4757 0.00 84 84 0 1 0
photo_lnk 41 0.99 84 84 0 4719 0
country_id 0 1.00 2 2 0 1 0
data_lnk 0 1.00 79 96 0 2 0
water_point_history 0 1.00 142 834 0 4750 0
clean_country_id 0 1.00 3 3 0 1 0
country_name 0 1.00 7 7 0 1 0
water_source 0 1.00 8 30 0 4 0
water_tech 0 1.00 5 37 0 20 0
adm2 0 1.00 3 14 0 33 0
adm3 4760 0.00 NA NA 0 0 0
management 1573 0.67 5 47 0 7 0
adm1 0 1.00 4 5 0 4 0
New Georeferenced Column 0 1.00 16 35 0 4760 0
lat_lon_deg 0 1.00 13 32 0 4760 0
public_data_source 0 1.00 84 102 0 2 0
converted 0 1.00 53 53 0 1 0
created_timestamp 0 1.00 22 22 0 2 0
updated_timestamp 0 1.00 22 22 0 2 0
Geometry 0 1.00 33 37 0 4760 0
ADM2_EN 0 1.00 3 14 0 30 0
ADM2_PCODE 0 1.00 8 8 0 30 0
ADM1_EN 0 1.00 4 4 0 1 0
ADM1_PCODE 0 1.00 5 5 0 1 0

Variable type: logical

skim_variable n_missing complete_rate mean count
rehab_year 4760 0 NaN :
rehabilitator 4760 0 NaN :
is_urban 0 1 0.39 FAL: 2884, TRU: 1876
latest_record 0 1 1.00 TRU: 4760
status 0 1 0.56 TRU: 2642, FAL: 2118

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
row_id 0 1.00 68550.48 10216.94 49601.00 66874.75 68244.50 69562.25 471319.00 ▇▁▁▁▁
lat_deg 0 1.00 7.68 0.22 7.06 7.51 7.71 7.88 8.06 ▁▂▇▇▇
lon_deg 0 1.00 4.54 0.21 4.08 4.36 4.56 4.71 5.06 ▃▆▇▇▂
install_year 1144 0.76 2008.63 6.04 1917.00 2006.00 2010.00 2013.00 2015.00 ▁▁▁▁▇
fecal_coliform_value 4760 0.00 NaN NA NA NA NA NA NA
distance_to_primary_road 0 1.00 5021.53 5648.34 0.01 719.36 2972.78 7314.73 26909.86 ▇▂▁▁▁
distance_to_secondary_road 0 1.00 3750.47 3938.63 0.15 460.90 2554.25 5791.94 19559.48 ▇▃▁▁▁
distance_to_tertiary_road 0 1.00 1259.28 1680.04 0.02 121.25 521.77 1834.42 10966.27 ▇▂▁▁▁
distance_to_city 0 1.00 16663.99 10960.82 53.05 7930.75 15030.41 24255.75 47934.34 ▇▇▆▃▁
distance_to_town 0 1.00 16726.59 12452.65 30.00 6876.92 12204.53 27739.46 44020.64 ▇▅▃▃▂
rehab_priority 2654 0.44 489.33 1658.81 0.00 7.00 91.50 376.25 29697.00 ▇▁▁▁▁
water_point_population 4 1.00 513.58 1458.92 0.00 14.00 119.00 433.25 29697.00 ▇▁▁▁▁
local_population_1km 4 1.00 2727.16 4189.46 0.00 176.00 1032.00 3717.00 36118.00 ▇▁▁▁▁
crucialness_score 798 0.83 0.26 0.28 0.00 0.07 0.15 0.35 1.00 ▇▃▁▁▁
pressure_score 798 0.83 1.46 4.16 0.00 0.12 0.41 1.24 93.69 ▇▁▁▁▁
usage_capacity 0 1.00 560.74 338.46 300.00 300.00 300.00 1000.00 1000.00 ▇▁▁▁▅
days_since_report 0 1.00 2692.69 41.92 1483.00 2688.00 2693.00 2700.00 4645.00 ▁▇▁▁▁
staleness_score 0 1.00 42.80 0.58 23.13 42.70 42.79 42.86 62.66 ▁▁▇▁▁
location_id 0 1.00 235865.49 6657.60 23741.00 230638.75 236199.50 240061.25 267454.00 ▁▁▁▁▇
cluster_size 0 1.00 1.05 0.25 1.00 1.00 1.00 1.00 4.00 ▇▁▁▁▁
lat_deg_original 4760 0.00 NaN NA NA NA NA NA NA
lon_deg_original 4760 0.00 NaN NA NA NA NA NA NA
count 0 1.00 1.00 0.00 1.00 1.00 1.00 1.00 1.00 ▁▁▇▁▁
Osun_wp_sf_clean <- Osun_wp_sf%>%
  filter_at(vars(status,
                 distance_to_primary_road,
                 distance_to_secondary_road,
                 distance_to_tertiary_road,
                 distance_to_city,
                 distance_to_town,
                 water_point_population,
                 local_population_1km,
                 usage_capacity,
                 is_urban,
                 water_source_clean),
            all_vars(!is.na(.)))%>%
  mutate(usage_capacity = as.factor(usage_capacity))

After the above code chunk run, it can be observed 4 observations are deleted and now there are total of 4756 observations with 75 columns.

Correlation Analysis

Osun_wp <- Osun_wp_sf_clean%>%
  select(c(7,35:39,42,43,46,47,57))%>%
  st_set_geometry(NULL)
cluster_vars.cor= cor(
  Osun_wp[,2:7])
corrplot.mixed(cluster_vars.cor,
               lower= "ellipse",
               upper= "number",
               tl.pos= "lt",
               diag= "l",
               tl.col= "black")

From the above result, it can observed there are none of the variables that are highly correlated, i.e. correlation greater than +/- 0.8. Therefore, we will consider all the variables for the further analysis.

Building a Logistic Regression Models

model <- glm(status ~ distance_to_primary_road+
               distance_to_secondary_road+
               distance_to_tertiary_road+
               distance_to_city+
               distance_to_town+
               is_urban+
               usage_capacity+
               water_source_clean+
               water_point_population+
              local_population_1km,
             data= Osun_wp_sf_clean,
             family= binomial(link= "logit"))

Instead of using typical R report, blr_regress() of blorr package is used.

blr_regress(model)
                             Model Overview                              
------------------------------------------------------------------------
Data Set    Resp Var    Obs.    Df. Model    Df. Residual    Convergence 
------------------------------------------------------------------------
  data       status     4756      4755           4744           TRUE     
------------------------------------------------------------------------

                    Response Summary                     
--------------------------------------------------------
Outcome        Frequency        Outcome        Frequency 
--------------------------------------------------------
   0             2114              1             2642    
--------------------------------------------------------

                                 Maximum Likelihood Estimates                                   
-----------------------------------------------------------------------------------------------
               Parameter                    DF    Estimate    Std. Error    z value     Pr(>|z|) 
-----------------------------------------------------------------------------------------------
              (Intercept)                   1      0.3887        0.1124      3.4588       5e-04 
        distance_to_primary_road            1      0.0000        0.0000     -0.7153      0.4744 
       distance_to_secondary_road           1      0.0000        0.0000     -0.5530      0.5802 
       distance_to_tertiary_road            1      1e-04         0.0000      4.6708      0.0000 
            distance_to_city                1      0.0000        0.0000     -4.7574      0.0000 
            distance_to_town                1      0.0000        0.0000     -4.9170      0.0000 
              is_urbanTRUE                  1     -0.2971        0.0819     -3.6294       3e-04 
           usage_capacity1000               1     -0.6230        0.0697     -8.9366      0.0000 
water_source_cleanProtected Shallow Well    1      0.5040        0.0857      5.8783      0.0000 
   water_source_cleanProtected Spring       1      1.2882        0.4388      2.9359      0.0033 
         water_point_population             1      -5e-04        0.0000    -11.3686      0.0000 
          local_population_1km              1      3e-04         0.0000     19.2953      0.0000 
-----------------------------------------------------------------------------------------------

 Association of Predicted Probabilities and Observed Responses  
---------------------------------------------------------------
% Concordant          0.7347          Somers' D        0.4693   
% Discordant          0.2653          Gamma            0.4693   
% Tied                0.0000          Tau-a            0.2318   
Pairs                5585188          c                0.7347   
---------------------------------------------------------------

From the model report, we observed that distance_to_primary_road and distance_to_secondary_road are not statistically significant (p value > 0.05). Therefore, these variables will be excluded for further analysis as they are not significant.

model_adjust <- glm(status ~
               distance_to_tertiary_road+
               distance_to_city+
               distance_to_town+
               is_urban+
               usage_capacity+
               water_source_clean+
               water_point_population+
               local_population_1km,
             data=Osun_wp_sf_clean,
             family=binomial(link="logit"))
blr_regress(model_adjust)
                             Model Overview                              
------------------------------------------------------------------------
Data Set    Resp Var    Obs.    Df. Model    Df. Residual    Convergence 
------------------------------------------------------------------------
  data       status     4756      4755           4746           TRUE     
------------------------------------------------------------------------

                    Response Summary                     
--------------------------------------------------------
Outcome        Frequency        Outcome        Frequency 
--------------------------------------------------------
   0             2114              1             2642    
--------------------------------------------------------

                                 Maximum Likelihood Estimates                                   
-----------------------------------------------------------------------------------------------
               Parameter                    DF    Estimate    Std. Error    z value     Pr(>|z|) 
-----------------------------------------------------------------------------------------------
              (Intercept)                   1      0.3540        0.1055      3.3541       8e-04 
       distance_to_tertiary_road            1      1e-04         0.0000      4.9096      0.0000 
            distance_to_city                1      0.0000        0.0000     -5.2022      0.0000 
            distance_to_town                1      0.0000        0.0000     -5.4660      0.0000 
              is_urbanTRUE                  1     -0.2667        0.0747     -3.5690       4e-04 
           usage_capacity1000               1     -0.6206        0.0697     -8.9081      0.0000 
water_source_cleanProtected Shallow Well    1      0.4947        0.0850      5.8228      0.0000 
   water_source_cleanProtected Spring       1      1.2790        0.4384      2.9174      0.0035 
         water_point_population             1      -5e-04        0.0000    -11.3902      0.0000 
          local_population_1km              1      3e-04         0.0000     19.4069      0.0000 
-----------------------------------------------------------------------------------------------

 Association of Predicted Probabilities and Observed Responses  
---------------------------------------------------------------
% Concordant          0.7349          Somers' D        0.4697   
% Discordant          0.2651          Gamma            0.4697   
% Tied                0.0000          Tau-a            0.2320   
Pairs                5585188          c                0.7349   
---------------------------------------------------------------
blr_confusion_matrix(model, cutoff= 0.5)
Confusion Matrix and Statistics 

          Reference
Prediction FALSE TRUE
         0  1301  738
         1   813 1904

                Accuracy : 0.6739 
     No Information Rate : 0.4445 

                   Kappa : 0.3373 

McNemars's Test P-Value  : 0.0602 

             Sensitivity : 0.7207 
             Specificity : 0.6154 
          Pos Pred Value : 0.7008 
          Neg Pred Value : 0.6381 
              Prevalence : 0.5555 
          Detection Rate : 0.4003 
    Detection Prevalence : 0.5713 
       Balanced Accuracy : 0.6680 
               Precision : 0.7008 
                  Recall : 0.7207 

        'Positive' Class : 1

The validity of a cutoff is measured using sensitivity, specificity and accuracy.

  1. Sensitivity: The % of correctly classified events out of all events= TP/(TP+FN)
  2. Specificity: The % of correctly classified non-events out of all events= TN/(TN+FP)
  3. Accuracy: The % of correctly classified observation over all observations= (TP+TN)/ (TP+FP+FN+TN)
blr_confusion_matrix(model_adjust,cutoff=0.5)
Confusion Matrix and Statistics 

          Reference
Prediction FALSE TRUE
         0  1300  743
         1   814 1899

                Accuracy : 0.6726 
     No Information Rate : 0.4445 

                   Kappa : 0.3348 

McNemars's Test P-Value  : 0.0761 

             Sensitivity : 0.7188 
             Specificity : 0.6149 
          Pos Pred Value : 0.7000 
          Neg Pred Value : 0.6363 
              Prevalence : 0.5555 
          Detection Rate : 0.3993 
    Detection Prevalence : 0.5704 
       Balanced Accuracy : 0.6669 
               Precision : 0.7000 
                  Recall : 0.7188 

        'Positive' Class : 1

The updated version of the first model yields slightly lower results of 0.7188 and 0.6149, while the original model is able to attain sensitivity and specificity values of 0.7207 and 0.6154.

These are good results, but they can be made much better by taking geographic considerations into account. In the following section, we will examine how to perform logistic regression that is weighted spatially.

Building Fixed Bandwidth GWR Model

Converting sf data frame to sp data frame

Osun_wp_sp <- Osun_wp_sf_clean%>%
  select(c(status,
           distance_to_primary_road,
           distance_to_secondary_road,
           distance_to_tertiary_road,
           distance_to_city,
           distance_to_town,
           water_point_population,
           local_population_1km,
           is_urban,
           usage_capacity,
           water_source_clean))%>%
  as_Spatial()
Osun_wp_sp
class       : SpatialPointsDataFrame 
features    : 4756 
extent      : 182502.4, 290751, 340054.1, 450905.3  (xmin, xmax, ymin, ymax)
crs         : +proj=tmerc +lat_0=4 +lon_0=8.5 +k=0.99975 +x_0=670553.98 +y_0=0 +a=6378249.145 +rf=293.465 +towgs84=-92,-93,122,0,0,0,0 +units=m +no_defs 
variables   : 11
names       : status, distance_to_primary_road, distance_to_secondary_road, distance_to_tertiary_road, distance_to_city, distance_to_town, water_point_population, local_population_1km, is_urban, usage_capacity, water_source_clean 
min values  :      0,        0.014461356813335,          0.152195902540837,         0.017815121653488, 53.0461399623541, 30.0019777713073,                      0,                    0,        0,           1000,           Borehole 
max values  :      1,         26909.8616132094,           19559.4793799085,          10966.2705628969,  47934.343603562, 44020.6393368124,                  29697,                36118,        1,            300,   Protected Spring 

Computing Fixed Bandwidth

Adjusted one:

bw.fixed <- bw.ggwr(status ~
               distance_to_tertiary_road+
               distance_to_city+
               distance_to_town+
               is_urban+
               usage_capacity+
               water_source_clean+
               water_point_population+
               local_population_1km,
             data=Osun_wp_sp,
             family="binomial",
             approach = "AIC",
             kernel = "gaussian",
             adaptive = FALSE,
             longlat = FALSE)
bw.fixed
gwlr.fixed <- ggwr.basic(status ~
                      distance_to_primary_road+
                      distance_to_secondary_road+
                      distance_to_tertiary_road+
                      distance_to_city+
                      distance_to_town+
                      water_point_population+
                      local_population_1km+
                      is_urban+
                      usage_capacity+
                      water_source_clean,
                    data= Osun_wp_sp,
                    bw= 2377.371,
                    family= "binomial",
                    kernel= "gaussian",
                    adaptive= FALSE,
                    longlat= FALSE)
 Iteration    Log-Likelihood
=========================
       0        -1879 
       1        -1579 
       2        -1416 
       3        -1322 
       4        -1281 
       5        -1281 

Model Assessment

Converting SDF into sf data.frame

To assess the performance of the gwLR, firstly, we will convert the SDF object in as data frame by using the code chunk below.

gwr.fixed <- as.data.frame(gwlr.fixed$SDF)

Next, we will label that values greater or equal to 0.5 into 1, else 0. The result the logic comparison operation will be saved into a field called most.

gwr.fixed <- gwr.fixed %>%
  mutate(most= ifelse(
    gwr.fixed$yhat >= 0.5, T, F))
gwr.fixed$y <- as.factor(gwr.fixed$y)
gwr.fixed$most <- as.factor(gwr.fixed$most)
CM <- confusionMatrix(data=gwr.fixed$most, reference= gwr.fixed$y)
CM
Confusion Matrix and Statistics

          Reference
Prediction FALSE TRUE
     FALSE  1858  226
     TRUE    256 2416
                                          
               Accuracy : 0.8987          
                 95% CI : (0.8897, 0.9071)
    No Information Rate : 0.5555          
    P-Value [Acc > NIR] : <2e-16          
                                          
                  Kappa : 0.7945          
                                          
 Mcnemar's Test P-Value : 0.1865          
                                          
            Sensitivity : 0.8789          
            Specificity : 0.9145          
         Pos Pred Value : 0.8916          
         Neg Pred Value : 0.9042          
             Prevalence : 0.4445          
         Detection Rate : 0.3907          
   Detection Prevalence : 0.4382          
      Balanced Accuracy : 0.8967          
                                          
       'Positive' Class : FALSE           
                                          

Geographical information is added to the logistic regression model, which considerably improves the model’s findings. The model’s overall accuracy is 0.8846 with improved sensitivity and specificity scores of 0.8789 and 0.9145, respectively. This is a significant improvement over the prior model without any consideration of geography.

Visualizing gwLR

Next, we will display the values of yhat from GWLR model using interactive tmap mode. The following columns will be first selected to facilitate the plotting of the map.

Osun_wp_sf_selected <- Osun_wp_sf_clean %>%
  select(c(ADM2_EN,ADM2_PCODE,ADM1_EN,ADM1_PCODE,status))
gwr_sf.fixed <- cbind(Osun_wp_sf_selected,gwr.fixed)
tmap_mode("view")
tmap mode set to interactive viewing
prob_T <- tm_shape(Osun) +
  tm_polygons(alpha=0.1) +
  tm_shape(gwr_sf.fixed) +
  tm_dots(col = "yhat",
          border.col = "gray60",
          border.lwd = 1) +
  tm_view(set.zoom.limits = c(9,12))

prob_T
tertiary_TV <- tm_shape(Osun)+
  tm_polygons(alpha=0.1)+
  tm_shape(gwr_sf.fixed)+
  tm_dots(col="distance_to_tertiary_road_TV",
          border.col="gray60",
          border.lwd=1)+
  tm_view(set.zoom.limits=c(9,12))
tertiary_TV
Variable(s) "distance_to_tertiary_road_TV" contains positive and negative values, so midpoint is set to 0. Set midpoint = NA to show the full spectrum of the color palette.